Conceptual Clustering of Heterogeneous Distributed Databases
نویسندگان
چکیده
With increasingly more databases becoming available on the Internet, there is a growing opportunity to globalise knowledge discovery and learn general patterns, rather than restricting learning to specific databases from which the rules may not be generalisable. Clustering of distributed databases facilitates learning of new concepts that characterise common features of, and differences between, datasets. We are here concerned with clustering databases that hold aggregate count data on a set of attributes that have been classified according to heterogeneous classification schemes. Such aggregates are commonly used for summarising very large databases such as those encountered in data warehousing, large-scale transaction management, and statistical databases. For measuring difference between aggregates we utilise two distance metrics: the Euclidean distance and the Kullback-Leibler information divergence. A hybrid between Kullback-Leibler and the Euclidean distance, which uses the former to learn the class probabilities and the latter as the corresponding distance measure, looks particularly promising both in terms of accuracy and scalability. These metrics are evaluated using synthetic data. Important applications of the work include the clustering of heterogeneous customer databases for the discovery of new marketing concepts and the clustering of medical databases for the discovery of new epidemiological concepts.
منابع مشابه
Model-based Clustering on Semantically Heterogeneous Distributed Databases on the Internet
The vision of the Semantic Web brings challenges to knowledge discovery on databases in such heterogeneous distributed open environment. The databases are developed independently with semantic information embedded, and they are heterogeneous with respect to the data granularity, ontology/scheme information etc. The Distributed knowledge discovery (DKD) methods are required to take semantic info...
متن کاملRelational Text Mining and Visualization
Discovering hidden patterns in distributed heterogeneous textual databases and unstructured data is a new challenge in data mining. Traditional data mining often assumes that preprocessing is already done -homogeneous data are available on the needed level. For distributed heterogeneous textual data this is not the case. Complex relations between items/entities (e.g., relations between people i...
متن کاملDistributed clustering and local regression for knowledge discovery in multiple spatial databases
Many large-scale spatial data analysis problems involve an investigation of relationships in heterogeneous databases. In such situations, instead of making predictions uniformly across entire spatial data sets, in a previous study we used clustering for identifying similar spatial regions and then constructed local regression models describing the relationship between data characteristics and t...
متن کاملOptimization of majority protocol for controlling transactions concurrency in distributed databases by multi-agent systems
In this paper, we propose a new concurrency control algorithm based on multi-agent systems which is an extension of majority protocol. Then, we suggest a clustering approach to get better results in reliability, decreasing message passing and algorithm’s runtime. Here, we consider n different transactions working on non-conflict data items. Considering execution efficiency of some different...
متن کاملA Conceptual Level Design Methodology for Probabilistic Relational Databases
When multiple heterogeneous databases show different values for the same data item, its actual value is not known with certainty. To develop corporate data warehouses, which consolidate data from multiple heterogeneous data sources, has become an important issue for designing modern business information systems. Probabilistic relational databases have extended from the relational database model...
متن کامل